Skip to content

fix(mcp): resolve SSE connection deadlock with asyncio.Lock#909

Closed
steveant wants to merge 1 commit intocoleam00:mainfrom
AeyeOps:fix/mcp-asyncio-lock-deadlock
Closed

fix(mcp): resolve SSE connection deadlock with asyncio.Lock#909
steveant wants to merge 1 commit intocoleam00:mainfrom
AeyeOps:fix/mcp-asyncio-lock-deadlock

Conversation

@steveant
Copy link
Copy Markdown

@steveant steveant commented Dec 20, 2025

Summary

Replace threading.Lock with asyncio.Lock in the MCP server lifespan manager to prevent event loop blocking when multiple SSE connections are made concurrently.

Problem

The original code used threading.Lock which blocks the entire event loop thread. When yielding while holding the lock, other connections could not proceed, causing the server to become unresponsive after 2+ concurrent SSE sessions.

Solution

  • Use asyncio.Lock (yields to other coroutines instead of blocking)
  • Yield context outside the lock block
  • Capture context in local variable to prevent race conditions
  • Add explicit None guard before yield
  • Add documentation comments explaining design decisions

Testing

Tested with 20 concurrent connections:

  • All requests completed successfully (< 0.5s total)
  • Single initialization confirmed via logs
  • No deadlock or timeout observed

Notes

  • asyncio.Lock is event-loop bound, safe for single Uvicorn worker (FastMCP's intended deployment)
  • Sync init calls inside lock don't cause deadlock but may delay concurrent connections under heavy load

Summary by CodeRabbit

  • Refactor
    • Enhanced internal initialization handling to improve startup reliability under concurrent connection scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

Replace threading.Lock with asyncio.Lock in lifespan manager to prevent
event loop blocking when multiple SSE connections are made concurrently.

Problem:
- threading.Lock blocks the entire event loop thread
- Yielding while holding the lock prevented other connections
- Server became unresponsive after 2+ concurrent SSE sessions

Solution:
- Use asyncio.Lock (yields to other coroutines instead of blocking)
- Yield context outside the lock block
- Capture context in local variable to prevent race conditions
- Add explicit None guard before yield

Tested with 20 concurrent connections - all passed, single init confirmed.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 20, 2025

Walkthrough

The initialization synchronization mechanism was refactored to replace a threading-based lock with an asyncio-compatible lock. The lifespan initialization flow now uses an async context manager with a double-check pattern, captures the shared context in a local variable, and yields it outside the lock to avoid race conditions during context access.

Changes

Cohort / File(s) Summary
Async initialization lock refactor
python/src/mcp_server/mcp_server.py
Added asyncio import; replaced threading.Lock() with asyncio.Lock() for the module-level _initialization_lock; updated lifespan initialization to use async context manager with double-check pattern; introduced local ctx variable to capture ArchonContext and prevent race conditions; moved context yield outside the lock with guard to raise on initialization failure; maintained shared context reuse and added cleanup logging.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12–18 minutes

  • Verify that asyncio.Lock() is the correct synchronization primitive for the async context
  • Confirm the double-check pattern correctly prevents race conditions without introducing new ones
  • Ensure the local ctx variable properly captures the ArchonContext and that the guard condition raises appropriately on initialization failure
  • Check that moving the yield outside the lock does not reintroduce timing vulnerabilities

Poem

🐰 A threading lock, so old and bold,
Gave way to async, swift and true—
With double-checks and context caught,
No race conditions slip right through!
The shared realm waits outside the gate,
Where locks release and tasks align. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: replacing threading.Lock with asyncio.Lock to fix an SSE connection deadlock issue.
Description check ✅ Passed The description covers the main sections (Summary, Problem, Solution, Testing, Notes) but is missing several template sections like Type of Change, Affected Services, Testing checklist items, and formal checklist verification.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
python/src/mcp_server/mcp_server.py (1)

163-176: Synchronous calls inside async lock may delay concurrent connections.

The comments correctly note that synchronous calls to get_session_manager() and get_mcp_service_client() inside the lock don't cause deadlock (asyncio.Lock yields control), but they may delay concurrent connection attempts during initialization.

For current usage this is acceptable, but if these become I/O-bound bottlenecks under high concurrent load, consider wrapping them with asyncio.to_thread() or loop.run_in_executor().

🔎 Optional optimization if initialization becomes a bottleneck

If profiling shows these sync calls are slow:

                # Initialize session manager
-                # NOTE: These sync calls run inside the lock. They don't cause deadlock
-                # (asyncio.Lock yields to other coroutines), but slow init here delays
-                # concurrent connection attempts. Consider run_in_executor if these
-                # become I/O-bound bottlenecks under load.
                logger.info("🔐 Initializing session manager...")
-                session_manager = get_session_manager()
+                session_manager = await asyncio.to_thread(get_session_manager)
                logger.info("✓ Session manager initialized")

                # Initialize service client for HTTP calls
                logger.info("🌐 Initializing service client...")
-                service_client = get_mcp_service_client()
+                service_client = await asyncio.to_thread(get_mcp_service_client)
                logger.info("✓ Service client initialized")

Note: Only apply if these functions are I/O-bound and measurements show they're bottlenecks.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecaece4 and 26beb62.

📒 Files selected for processing (1)
  • python/src/mcp_server/mcp_server.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

  • python/src/mcp_server/mcp_server.py
python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

  • python/src/mcp_server/mcp_server.py
🧠 Learnings (1)
📚 Learning: 2025-11-29T19:24:46.265Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: archon-example-workflow/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:46.265Z
Learning: Check if Archon MCP server is available before doing anything else in task management scenarios

Applied to files:

  • python/src/mcp_server/mcp_server.py
🧬 Code graph analysis (1)
python/src/mcp_server/mcp_server.py (1)
python/src/server/utils/progress/progress_tracker.py (1)
  • error (196-219)
🔇 Additional comments (6)
python/src/mcp_server/mcp_server.py (6)

17-17: LGTM: Asyncio import added correctly.

The asyncio import is necessary for asyncio.Lock() usage and is properly placed with other standard library imports.


144-148: LGTM: Optimistic check pattern is efficient.

The quick check without acquiring the lock (lines 144-148) optimizes the common case where initialization is already complete, allowing subsequent SSE connections to proceed immediately.

The explicit is not None check is good defensive programming.


150-158: LGTM: Double-check pattern correctly implemented.

The double-check locking pattern is properly implemented:

  1. Local variable ctx avoids race conditions between lock release and yield
  2. Type hint ArchonContext | None is explicit
  3. Async lock context manager is correct
  4. Double-check inside lock prevents duplicate initialization

185-192: LGTM: Atomic assignment and comprehensive error handling.

The context is assigned to _shared_context last (line 186) for atomicity, and the flag is set after successful initialization (line 187). Error handling properly logs the exception with full traceback per coding guidelines.


194-203: LGTM: Yielding outside lock prevents blocking—key fix for deadlock.

This is the critical fix: yielding the context outside the lock (line 200) ensures other concurrent connections aren't blocked while this connection is active.

The None guard (lines 197-198) is good defensive programming, though unlikely to trigger in normal operation since initialization either succeeds or raises.


67-70: Excellent fix addressing event-loop blocking with clear documentation.

The change from threading.Lock to asyncio.Lock correctly resolves the event-loop blocking issue that caused SSE connection deadlocks. The inline documentation clearly explains the event-loop binding constraint and deployment assumptions.

Single-worker deployment is enforced by default—neither the Dockerfile.mcp nor docker-compose.yml configuration specifies worker processes, so Uvicorn runs with a single worker. The code comment provides clear guidance for any future multi-worker scenarios.

steveant added a commit to AeyeOps/archon that referenced this pull request Dec 20, 2025
- Add supabase-network to archon-agent-work-orders service
  (required for local Supabase CLI connectivity)
- Fix MCP SSE deadlock by using asyncio.Lock instead of threading.Lock
  (PR coleam00#909 submitted upstream)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@steveant
Copy link
Copy Markdown
Author

Superseded by #928 which consolidates this fix along with other improvements.

@steveant steveant closed this Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant